专利摘要:
a recurring semantic segmentation system, article and method for image processing taking into account historical semantic segmentation.
公开号:BR102018075714A2
申请号:R102018075714-8
申请日:2018-12-11
公开日:2019-07-30
发明作者:Shahar Fleishman;Naomi Ken Korem;Mark Kliger
申请人:Intel Corporation;
IPC主号:
专利说明:

BACKGROUND [0001] Computer vision provides visual capabilities to computers or automated machines. Thus, it is desirable in computer vision to provide these systems with the ability to reason about the physical world, being able to understand what is being seen in 3D and in images captured by cameras, for example. In other words, applications in robotics, virtual reality (VR), augmented reality (AR) and mixed reality (RM) may need to understand the world around the robot or the person providing the point of view in the applications. For example, a robot needs to understand what it sees in order to manipulate (grab, move, etc.) objects. RV, RA or RM applications need to understand the world around the person providing the point of view so that when the person moves in that world, the person sees how to avoid obstacles in that world, for example. This capability also allows these computer vision systems to add semantically plausible virtual objects to the world environment. In this way, a system that understands that it is seeing a lamp can understand the lamp's purpose and operation. For these purposes, a 3D semantic representation of the world can be formed in the form of a semantic segmentation model (or semantic model only) using 3D semantic segmentation techniques.
[0002] These semantic segmentation techniques often involve building a 3D geometric model, and then building a 3D semantic model based on the geometric model where the 3D semantic model is formed by voxels, at
Petition 870180161488, of 12/11/2018, p. 80/234
2/59 which are assigned to each one, definitions for the object of which these voxels are part of a 3D space, such as furniture such as chair, sofa, table or parts of the room, such as floor or wall, etc. . The 3D semantic model is updated over time by segmenting a current frame to form a segmented frame, and registering the segmented frame in the model based on heuristic rules or a Bayesian update, as well as the current camera pose used to form the frame current. The semantic model can then be used by different applications, such as computer vision, to perform tasks or analyze 3D space as described above.
[0003] However, this update of the semantic segmentation model is often inaccurate and results in poor performance, as it does not adequately take into account the history of the semantic update. In other words, semantic segmentation is often updated one frame at a time. A current frame is semantically segmented to form a segmented or label frame, and this is repeated for individual current frames in a video sequence. Each semantically segmented frame, depending on the pose of a camera (or a sensor) used to form the current frame, is then used to update the semantic model. This is typically done without taking into account the sequence or history of semantic update that occurred earlier during a video sequence while semantic segmentation of the current frame is performed to form the segmented frame. This results in a significantly less accurate analysis resulting in errors and inaccuracies in semantic assignments to the vertices or voxels in the semantic model.
DESCRIPTION OF THE FIGURES [0004] The material described here is illustrated as an example and not
Petition 870180161488, of 12/11/2018, p. 81/234
3/59 as a limitation in the attached figures. For simplicity and clarity of the illustration, the elements illustrated in the figures are not necessarily in actual size. For example, the dimensions of some elements may be exaggerated in relation to other elements for reasons of clarity. In addition, when deemed appropriate, reference labels were repeated between the figures to indicate corresponding or similar elements. In the figures: [0005] Figure 1 is a schematic flowchart showing a conventional semantic segmentation method;
[0006] Figure 2 is an illustration of an image with geometric segmentation;
[0007] Figure 3 is an illustration of a semantic segmentation of an image by labeled classifications;
[0008] Figure 4 is a flowchart of a method of semantic segmentation of images according to the implementations here;
[0009] Figures 5A and 5B are a detailed flow chart of a method of semantic segmentation of images according to the implementations here;
[0010] Figure 6 is a schematic diagram of a system for performing semantic segmentation of images according to the implementations here;
[0011] Figure 7 is a close view of a portion of an image with semantic segmentation according to the semantic segmentation implementations disclosed here;
[0012] Figure 8 is a unit of semantic segmentation of the system of Figure 6 according to the implementations of semantic segmentation disclosed here;
[0013] Figure 9 is an illustrative diagram of an example system;
[0014] Figure 10 is a diagram illustrating another system of
Petition 870180161488, of 12/11/2018, p. 82/234
4/59 example; and [0015] Figure 11 illustrates another example device, all arranged according to at least some implementations of the present disclosure.
DETAILED DESCRIPTION [0016] One or more implementations are now described with reference to the included figures. Although specific configurations and provisions are discussed, it should be understood that this is done for illustrative purposes only. Those skilled in the art will recognize that other configurations and arrangements can be employed without departing from the spirit and scope of the description. It will be apparent to those skilled in the art that the techniques and / or arrangements described herein can also be employed in a variety of other systems and applications in addition to those described herein.
[0017] Although the following description presents several implementations that can manifest themselves in architectures, such as system architectures on a chip (SoC - System On a Chip), the implementation of the techniques and / or provisions described here is not restricted to architectures and / or particular computing systems and they can be implemented by any architecture and / or computing system for similar purposes. For example, various architectures employing, for example, multiple integrated circuit chips (CI) and / or packages, and / or various computing devices and / or consumer electronic devices (EC), such as imaging devices, digital cameras , smartphones, webcams, video game panels or consoles, television decoders, tablets, etc., where anyone can have sensors and / or light projectors to perform object detection, depth measurement and other tasks, and can implement the techniques and / or provisions described herein. Furthermore, although the following description
Petition 870180161488, of 12/11/2018, p. 83/234
5/59 can present numerous specific details, such as logical implementations, types and interrelationships of system components, options for logical integration / partitioning, etc., the claimed discussion material can be practiced without these specific details. In other instances, some material, such as complete control structures and software instruction sequences, may not be shown in detail so as not to hide the material disclosed here. The material disclosed here can be implemented in hardware, firmware, software or any combination thereof.
[0018] The material disclosed here can also be implemented as instructions stored in a memory or a machine-readable medium, which can be read and executed by one or more processors. A machine-readable medium may include any medium and / or mechanism for storing or transmitting information in a machine-readable form (for example, a computing device). For example, a machine-readable medium may include read-only memory (ROM); random access memory (RAM); magnetic disk storage means; optical storage media; flash memory devices; electrical, optical, acoustic or other forms of propagated signals (for example, carrier waves, infrared signals, digital signals, etc.) and others. In another form, a non-transitory article, such as a non-transitory computer-readable medium, can be used with any of the examples mentioned above or other examples, except that it does not include a transitory sign per se. It includes elements in addition to a signal per se that can temporarily retain data in a transient manner, such as RAM, etc.
Petition 870180161488, of 12/11/2018, p. 84/234
6/59 [0019] References in the specification to an implementation, an example implementation, etc., indicate that the implementation described may include a particular feature, structure or characteristic, but all implementations may not necessarily include the particular feature, structure or particular feature. Furthermore, these expressions do not necessarily refer to the same implementation. Furthermore, when a particular feature, structure or feature is described in connection with an implementation, it is suggested that it is known to one skilled in the art that that feature, structure or feature can be performed in connection with other implementations, regardless of whether or not explicitly described here.
[0020] Systems, articles and methods to provide recurring semantic segmentation for image processing.
[0021] As mentioned, computer vision is often used to reason about the physical world. Applications in robotics, virtual reality (VR), augmented reality (AR) and mixed reality (RM) may need to understand the world around the camera sensor, whether the camera sensor is in a robot or for the point of view (POV - Point of View) of a user. For example, a robot may need to understand what it sees in order to manipulate (grab, move, etc.) objects. RV / RA / RM applications may need to understand the world in order to avoid obstacles as the user moves, and add semantically plausible virtual objects to the environment. For this purpose, a 3D semantic representation of the world in the form of a 3D semantic model can be used.
[0022] Regarding Figure 1, some existing solutions perform 3D semantic segmentation using depth cameras
Petition 870180161488, of 12/11/2018, p. 85/234
7/59 red-green-blue color scheme (referred to as RGBD Red-Green-Blue color scheme Depth) that provide luminance and color image data, as well as image depth maps. This can include taking images from a single (monocular) camera moving around a scene or stereo systems that use multiple cameras to capture the same scene from different angles. Generally, 3D semantic segmentation can first include building a 3D geometric model from the image and depth data and then recording semantically segmented images in the geometric model to form a 3D semantic model. Thus, in relation to Figure 1, a 3D semantic process 100 can be performed in three main operations: (1) a first geometric stage or direct channel 102 that uses a simultaneous RGBD-location and mapping algorithm (RGBD-SLAM - Simultaneous Localization And Mapping) dense for 3D geometric reconstruction, (2) a second stage 104 that performs semantic segmentation based on a current frame, and (3) a third semantic model update stage 106 that performs an update of the 3D semantic model (based on heuristic rules or a Bayesian update).
[0023] Regarding the 3D 102 geometric model generation operation, it can have two parts: (i) find the camera position of individual frames and (ii) map the environment around the camera. According to an example form, and initially, this may involve obtaining an input depth image or depth map 108 of the current frame formed by triangulation, for example, and in addition to chroma or luminance data, such as RGB data , generally referred to here as image data from the current frame, or the current frame itself. Depth map 108 and the current frame can then be used to
Petition 870180161488, of 12/11/2018, p. 86/234
8/59 form a 3D geometric model 110 of the scene being captured by the camera (s). The 3D geometric model 110 can be a stored 3D volumetric grid, for example. First, a 3D rendered image 112 of the model 110 can be rendered by Ray Casting from a previously known camera position (pose k-1) on a model 110 image plane. The previously known camera position can be a pose of a first or just previous frame in relation to the current frame being analyzed. Thereafter, the new individual current frames 114, or each of them, in pose k are recorded in one of the rendered images 112 in pose k - 1 (or previous) of model 110 to compute a new camera position or pose estimate 116 Specifically, the current frame 114 can be recorded in rendering 112 of the model using a closer iterative point algorithm (ICP - IterativeClosest Point). ICP algorithms can be used to compute a rigid transformation (rotation and translation) between rendering models 112 and the current frame 114. This transformation is then applied to the previous known camera position (pose k-1) 112 to obtain the new camera position (New Pose Est.) 116. Given the estimated new camera position 116, the 3D location of each pixel in the world is known for this new position 116. Then, geometric model 110 can be updated with point or vertex position data from the new pose estimate 116. In turn, the volumetric semantic representation or geometric model 122 can be updated in stages 2 and 3 (104 and 106) as described below.
[0024] To perform the 3D geometric construction, several different RGBD-SLAM algorithms can be used. A dense RGBD-SLAM algorithm can use a 3D reconstruction algorithm that builds a 3D model incrementally, referring to the addition of
Petition 870180161488, of 12/11/2018, p. 87/234
9/59 increments, or 3D sections, to the 3D geometric model 110 one at a time, and which can be supplied by different frames at a time. In this way, new 3D points in each RGBD frame that appears are registered in the existing model, and the model is updated in an incremental way. See, for example, Newcombe et al., KinectFusion: Real-time dense surface mapping and tracking, ISMAR (pp. 127 to 136), IEEE Computer Society (2011); and Finman et al., Efficient Incremental Map Segmentation in Dense RGB-D Maps, Proc. Int. Conf. on Robotics and Automation (ICRA), p. 5488 to 5494 (2014). In this case, dense refers to the relatively large number of points that can be used to build the model. The 3D geometric model can be represented as a volumetric grid using a Signed-Distance-Function (SDF) method. See Curless et al., A Volumetric Method for Building Complex Models from Range Images, SIGGRAPH (1996). In these dense incremental RGBD-SLAM algorithms, 3D reconstruction can be performed in real time and typically keeping a model of the scene reconstructed locally in the memory of a device, such as smartphones or RA head accessory, rather than remotely. However, dense RGBD-SLAM alone does not retrieve any semantic information about the model, and semantic segmentation typically involves very large computational loads in order to be usually performed remotely from small devices.
[0025] Particularly, regarding the second stage (the semantic segmentation) 104, the semantic information can be captured with a semantic segmentation algorithm. Usually, these semantic algorithms segment a single frame at a time, and the semantic segmentation of temporal data, such as RGB or RGBD videos, is usually not based on
Petition 870180161488, of 12/11/2018, p. 88/234
10/59 3D information. However, and as exemplified by the 3D 100 segmentation process, an example frame combines a dense SLAM algorithm with a 3D semantic segmentation algorithm. See Tateno et al., Real-time and scalable incremental segmentation on dense SLAM, IROS (2015). This semantic segmentation process creates a 3D 122 semantic model that maintains the semantic tag of each voxel.
[0026] Specifically, Tateno reveals the projection of the current global model 122 to form a map of global model labels. However, a current frame is used to form a current depth map, and that current depth map is segmented, and the segments are semantically tagged. According to one form, this includes the use of component analysis algorithms connected to a depth map border map, which can generate a current label map. This process then compares the current label map with the global model label map, finding the segments that have the same semantic label, and comparing them to form a propagated label map. The propagated tag map can be modified by merging adjacent segments with the same tag.
[0027] In the update of the third stage 106 for systems that perform semantic segmentation without 3D data, the rendering or new estimate of pose 116 is then compared with (or recorded in) the semantic segmentation frame 120 formed using the current frame 118, and model 122 is updated by a heuristic (or voting), Bayesian or similar method. In Tateno, where 3D data is used, the update is carried out using the map of propagated tags to update the global model depending on an accumulated confidence score for the semantic tags of each segment.
Petition 870180161488, of 12/11/2018, p. 89/234
11/59 [0028] Initially, Tateno used segmentation based on the geometric plane that was later replaced by a semantic segmentation algorithm for deep learning. See Tateno et al., CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction, arXiv preprint arXiv: 1704.03489 (2017). In McCormac et al., This idea is expanded, and confidence scores are maintained by semantic class instead of maintaining a single semantic tag for each voxel. See McCormac et al. SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks, arXiv preprint arXiv: 1609.05130 (2016). Confidence scores by class are updated quickly whenever new observations are available. As mentioned for these Tateno systems, the semantic segmentation algorithm is applied to a single image 118 and then propagated to model 122 to update the model.
[0029] These conventional 3D segmentation systems have several difficulties. For systems that merely analyze one frame at a time independently, semantic segmentation is often inadequate mainly because of an insufficient amount of data resulting in noise and errors, such as hops, inconsistency and incorrect segment labels as a frame video is being analyzed for semantic segmentation. These systems are unable to adequately handle significant changes in voxel data due to variations in image data over time, such as with videos of moving objects or fast moving cameras.
[0030] Furthermore, since the conventional semantic segmentation of the current frame does not take into account the semantic segmentation of previous frames, the global model update, if
Petition 870180161488, of 12/11/2018, p. 90/234
12/59 is present at all, as for example in Tateno, is limited to local analysis where a current segment tag is compared to a global tag as described above with respect to Tateno. It does not have the ability to perform a global historical analysis (in this case, global referring to an entire frame) to determine whether changes in the distribution of image data over an entire frame indicate certain semantic tags in a specific segment. These distributions can capture when certain labels are seen together in the same frame, such as an automobile and a road, or a chair and table, thereby substantially increasing the efficiency of classification. Thus, if a frame has variable image data over time (frame by frame) in the conventional system, thus changing the distribution of image data over large areas of a frame over time, that data in the conventional system does not have necessarily an effect on the probabilities of semantic tags for semantic segmentation of a specific segment at a current location in the frame in the Tateno system. This results in significantly less accurate semantic segmentation, as these known systems are unable to accurately label when large unknown variations in segment data occur as mentioned above.
[0031] Furthermore, in the existing solutions, the update step is usually based on deterministic rules and is slow to adapt, or is unable to adapt at all, to specific scenarios or data of particular situations. Similar to the above, these rules are limited to local rather than global analysis. For example, semantic segmentation is often restricted to heuristics or Bayesian operations when using a current pose to modify a semantically segmented frame to update the
Petition 870180161488, of 12/11/2018, p. 91/234
13/59 semantic model. However, these rules do not consider data from other segments or the entire frame and over time, so that the distribution of data over wide areas of a frame over time is not considered. As such, these strict rules also fail to take into account relatively large variations in segment data, including variations over time (frame by frame), which can be used to form more accurate labels.
[0032] Likewise, since more data is added to the 3D semantic model whenever a frame is added, the computational complexity and the size of the network being treated to perform the semantic segmentation may be too large to be treated on a small device due to to the limited memory, processing power and power capacity of these devices when the system analyzes the entire 3D semantic model to perform 3D semantic segmentation. As such, most 3D segmentation is done offline resulting in an unreasonable delay in transmissions to and from that device. As such, these systems are not suitable for real-time applications on small devices. The functioning of these devices or computers can be improved with more efficient semantic segmentation that reduces the computational load and, consequently, the memory and power capacity used for semantic segmentation, thus allowing the semantic segmentation to be performed on smaller devices. One of these solutions is an incremental algorithm provided by Finman et al., Cited above, that reveals only the improvement of newly added sections of the model instead of the entire model. However, this solution is still inadequate, since the model can still grow too much with each frame added, and
Petition 870180161488, of 12/11/2018, p. 92/234
14/59 existing model data is not updated sufficiently for errors and inaccuracies to occur.
[0033] Finally, since Tateno merges segments on the propagated tag map before updating the global model, it is not possible to reverse this process before providing a final tag and updating the global model. This can maintain segment tag inaccuracies when further analysis reconfirms a previous segmentation where two original joined segments should have been kept separate instead of being joined.
[0034] To resolve these issues, a system and method that recurrently use historical semantic data to perform semantic segmentation of a current frame and to be used to update a 3D semantic model are revealed here. In this case, historical refers to the use of previous semantic tagging in the 3D semantic model. A rendered semantic segmentation map is generated using projection from the 3D semantic model and to represent this historical data, and the segmentation map can be provided in the same pose as the current frame being analyzed. The rendered segmentation map can be provided for each frame being analyzed to establish recurrence.
[0035] In addition, a recurring 3D semantic segmentation algorithm is used considering as input the rendered semantic segmentation map of the model as well as an input image of the current frame. The recurring 3D semantic segmentation algorithm can include CNN-based architecture that receives this paired input to synergistically analyze the distribution of image data in the input together. For example, using the entire frame allows the system to know which classes appear together. When the system recognizes a table, it can easily recognize a chair, since it is expected to appear with a table,
Petition 870180161488, of 12/11/2018, p. 93/234
15/59 while the system eliminates other objects more easily (for example, there will probably be no horses in the image). The output of the system is an updated 3D representation (model) of the world with individual voxels of the model being classified semantically.
[0036] In the method and system revealed here, the recurrent 3D semantic segmentation algorithm can merge efficient geometric segmentation with high performance 3D semantic segmentation. For example, using dense SLAM (simultaneous location and mapping) based on RGB-D data (RGB data and depth cameras, for example, Intel RealSense depth sensors) and semantic segmentation with convolutional neural networks (CNN - Convolutional Neural Networks) on a recurring basis as described herein. Thus, 3D semantic segmentation can include: (i) dense RGBD-SLAM for 3D geometry reconstruction; (ii) recurring segmentation based on CNN that receives as input the current frame and 3D semantic information from previous frames; and (iii) a copy of the results of (ii) for the 3D semantic model. It is possible to mention that operation (ii) uses the previous frames and performs both the segmentation and the update of the model in a single step. In other words, many conventional semantic segmentation systems perform a segmentation algorithm that obtains segment tags first, and then performs some kind of comparison or confidence value computation with thresholds to determine whether the semantic model should be updated with the semantic tags as in Tateno for example. Instead, the present system performs a sufficiently precise analysis with extraction of particularities and neural networks of semantic segmentation to obtain the semantic segment tags so that this second stage of value of
Petition 870180161488, of 12/11/2018, p. 94/234
16/59 confidence is unnecessary.
[0037] In the present solution, the recurrent segmentation operation (use of semantic information from previous frames as reflected in the 3D semantic model) is learned from the data and adjusted to specific scenarios, and as a consequence this solution is more accurate when the data of are changing rapidly and for a wider variety of image data scenarios. Thus, this recurrence results in a high quality and computationally efficient 3D semantic segmentation system.
[0038] With respect to Figure 2 for example, an image 200 shows an exposed top view of a room 202 with a border 206 highlighted between the room 202 and a bottom 204 to show geometric segmentation. Room 202 has facilities and furniture 208 that can also be geometrically segmented between themselves and the background.
[0039] Regarding Figure 3, an image 300 shows room 202 (now 301) of image 200, only now with the revealed 3D semantic segmentation applied. Each voxel color or shade represents a class of an object to which it belongs. With this semantic segmentation, actions can be performed depending on the segment's semantic tag, either for computer vision or for other applications, such as with virtual or augmented reality, for example.
[0040] In the revealed system and method, these two tasks applied to room 202 are merged into a common framework that is computationally efficient and produces high quality 3D semantic segmentation as described here. Furthermore, unlike typical semantic segmentation algorithms that operate in a single frame, the revealed methods use an algorithm
Petition 870180161488, of 12/11/2018, p. 95/234
17/59 that takes advantage of nature increases by scanning time of an environment from multiple points of view in order to improve segmentation and build a 3D model that is augmented with semantic information. This semantic segmentation process can be an incremental semantic segmentation over time of an environment in which the input can be an RGB-D frame stream (RGB camera and depth sensor / camera streams).
[0041] Practical robotic and VR applications often need to work immediately as soon as they are started, gather information about the visible environment and incrementally improve the 3D semantic model over time. In addition, this algorithm must cooperate with changes in the environment, that is, objects, furniture and people moving in the environment. Consequently, an offline segmentation and reconstruction algorithm cannot solve this problem, and a real-time incremental algorithm should be used. This is accomplished using the semantic segmentation method described here that effectively provides an incremental time process as also disclosed here.
[0042] In the revealed solution, the computation time per frame is a function of the number of pixels in that frame, and not the entire history of frames or the model, and as a result this method is computationally efficient and can be executed in real time on small devices. In particular, the system, with its semantic segmentation network, receives merely a rendered semantic map and current images from the same image point of view as the input to the system that needs to be placed in memory for processor segmentation access. This rendered map represents the history of the 3D semantic model. Instead, conventional systems typically take the entire 3D semantic model as the input and place it in accessible memory, and for
Petition 870180161488, of 12/11/2018, p. 96/234
18/59 consequence the whole history of the 3D semantic model. Thus, the entry for the system revealed here is a much smaller entry than the insertion of the entire semantic model as an additional entry in the semantic segmentation network. Since the input is a fixed frame size for both the current frame and the rendered segmentation map entries, the computation time can be fixed and is independent of the model size. This further avoids an increase in computing time as the 3D semantic model grows, the growth being explained above, and avoiding dependence on the increasing size of the model.
[0043] Finally, it will be noted that, since the segmentation map is used to form the segmentation frame in the first place, and by entering one or more layers of additional neural network that considered the pixels individually, this process avoids problems with permanently merging segments too early as in Tateno. [0044] Regarding Figure 4, a 400 process is provided for a recurring semantic segmentation method and system for image processing. In the illustrated implementation, process 400 may include one or more operations, functions or actions 402 to 412 numbered with even numbers. As a non-limiting example, process 400 can be described herein with reference to the example image processing system 600 of Figure 6 or system 900 of Figure 9, and where relevant.
[0045] Process 400 may include obtaining a video sequence of frames of image data and comprising a current frame 402. This operation may include obtaining raw image data preprocessed with RGB, YUV or other space values color and luminance values for several frames of a video sequence. The color and luminance values can be
Petition 870180161488, of 12/11/2018, p. 97/234
19/59 provided in many different additional ways, such as gradients, histograms, etc. Pre-processing can include demosaicing, noise reduction, pixel linearization, shading compensation, resolution reduction, vignetting and / or 3A-related operations including automatic white balance (AWB) modifications, focus automatic (AF - Automatic Focus) and / or automatic exposure (AE Automatic Exposure), etc.
[0046] This operation may also include obtaining depth data when depth data is used for segmentation analysis. Depth image data can be determined by a stereo camera system, such as with RGBD cameras, which captures images of the same scene or scene in motion from multiple angles. The system can perform several computations to determine a 3D space for the scene in the image and the depth dimension for each point, pixel, feature or object in the image. Otherwise, other ways of determining three dimensions of a single camera are possible, such as flight time and structural or coded light technologies.
[0047] Process 400 can optionally include recurrently generating a semantic segmentation map in a view of a current pose of the current frame and comprising obtaining data to form the semantic segmentation map from a 3D semantic segmentation model, in which maps of individual semantic segmentation are each associated with a different current frame from the 404 video sequence. This may include the generation of a 3D semantic segmentation model, based on a 3D geometric model (generated by using RGB-SLAM for example) with semantic tags registered in the model. After
Petition 870180161488, of 12/11/2018, p. 98/234
20/59 established, the 3D semantic segmentation model can be projected on an image plane to form a segmentation map with the semantic tags of the model that have pixels or voxels in that plane. The image plane can be the plane formed by the camera pose of the current frame being analyzed. The 3D semantic segmentation model can be updated with semantic segment tags, each current frame being semantically analyzed so that the 3D semantic model reflects or represents the history of the semantic segmentation of the 3D space represented by the 3D semantic model until a moment in the current time . Thus, in turn, the semantic segmentation map will also represent this history of segmentation.
[0048] Then, the process 400 may include extracting semantically particularities historically influenced semantically from the semantic segmentation map 406, and consequently an extraction algorithm can be applied to the segmentation map. The extraction algorithm can include a neural network, such as a CNN, with one or more layers and can be a ResNet neural network. The result of this extraction can be considered as peculiarities of a high level of intermediate value historically influenced that represent the semantic labeling in the segmentation map. In this case, the particulars are not semantic probability values or tag classes. These particularities or particularity values can eventually be used as inputs (or to compute inputs) for another (or last) segmentation neural network that forms semantic probabilities for semantic classes and a pixel or other base, such as segments. The particularities can be found in the form of matrix tensors, each formed by vectors of particularities of the particularities, for example. Below,
Petition 870180161488, of 12/11/2018, p. 99/234
21/59 more details are provided. Since the segmentation map already has semantic values, this operation can also be referred to as a refinement segmentation.
[0049] However, process 400 may include extracting current semantic features from the current frame 408, and this may include a current feature extraction algorithm, which may also correspond to one or more layers of neural network. This can correspond to 3D or 2D data depending on the algorithm, and the results here can also be in the form of vector matrix tensors of semantic particularities, and where the particularities are high-level particulars or intermediate values instead of probabilities or classes their semantics.
[0050] Next, process 400 may include generating a current and historical semantically segmented frame comprising the use of both current semantic features and historically influenced semantic features as input into a neural network that indicates the semantic tags for areas of the semantically segmented frame current history 410. This can occur in several different forms as long as both the current semantic features and the historically influenced semantic features are introduced for analysis together, such as entering a neural network, such as a CNN. In this way, current semantic features and historically influenced semantic features can be combined, or according to a concatenated example, before being introduced into the neural network together. This may include the concatenation of the particularities in the form of particularity vectors, or matrices or tensors that include the particularity vectors. According to one shape, a large tensor is formed
Petition 870180161488, of 12/11/2018, p. 100/234
22/59 with each data concatenation of a 3D section of the outputs of the current segmentation map and images. In this example, the feature vectors from the two different sources are put together in a single large feature vector and represent the same corresponding pixel locations on the current segmentation map and images. In this way, this operation can also involve matching the particulars or the current frame and the segmentation map to perform the concatenation, and this can be done automatically simply in the order in which the data is provided to the system.
[0051] The concatenated data is then inserted into another or last neural segmentation network (or CNN according to an example) with one or more layers. In this way, the distribution of data in a frame and in both current and historical semantic data is analyzed together resulting in a very precise semantic segmentation. This output forms a segmentation frame for tags or probabilities of segments in the frame.
[0052] Process 400 may include updating semantically the 3D semantic segmentation model including the use of the current and historical semantically segmented frame 412, which refers to the registration of the semantic tags or probabilities of the segmentation frame in the 3D semantic model. This can be done by first projecting the segmentation frame segments onto an image plane from the perspective of a new pose estimate from the geometric side of the system. Once in this new pose estimate, this image and its semantic data are placed in the appropriate correspondence location in the 3D semantic model. This effectively registers the input RGBD frame in the 3D semantic model. Details for determining the new pose estimate are provided below.
Petition 870180161488, of 12/11/2018, p. 101/234
23/59 [0053] Regarding Figures 5A and 5B, a process 500 is provided for a recurring semantic segmentation method and system for image processing. In the illustrated implementation, process 500 may include one or more operations, functions or actions 502 to 538 numbered with even numbers. As a non-limiting example, process 500 can be described here with reference to the example image processing system 600 of Figure 6 or system 900 of Figure 9, and where relevant.
[0054] Regarding Figures 6 and 8, in particular, process 500 can be operated by a semantic segmentation system 600 that has units that can be associated with three main stages or operations: geometric segmentation 602, semantic segmentation 604 and semantic update of 3D semantic model 605. As mentioned above, the system 600 can be a combination of geometric segmentation 602 as for example by an RGB-SLAM algorithm as described above and which provides the basis for a 3D semantic model 616. The semantic model 616 is updated or reconstructed using semantic segmentation 604.
[0055] For the first stage of geometric segmentation 602, the system 600 can have a new 610 pose estimate unit that receives a current 608 image in a k pose and a 606 rendered image in the k - 1 pose. The rendered image 606 is generated by Ray Casting from a reconstructed 3D geometric model 612. A geometric model update unit 614 then uses the new pose estimate to update the 3D geometry of geometric model 612. Further details are provided below with process description 500.
[0056] Regarding the semantic segmentation stage 604, the system 600 has a unit of semantic segmentation maps 620 that forms a segmentation map 622 in a plane of
Petition 870180161488, of 12/11/2018, p. 102/234
24/59 image in the same pose as the current frame being analyzed and projected from the 3D 616 semantic model. The projected or rendered semantic segmentation map 622, along with the RGB image data from the current 618 frame in pose k, is provided to a semantic frame segmentation unit 624 to form a semantically segmented frame 626. semantic frame segmentation unit 624 uses both inputs to use historical data to improve the accuracy of semantic tags, and an example of semantic frame segmentation unit 624 is provided by Figure 8. The semantic frame segmentation unit 624 or 800 may have a high-level feature extraction unit of historical semantic intermediate value (or historical extraction unit) 808 that performs a high-level feature extraction in the 804 (or 622) targeting map. The historical extraction unit 808 can include a preprocessing unit 810, a convolution unit 812 and / or a ResNet 814 unit, which are described below. In reality, these units may refer to certain functions and may not be separate chronological steps as described below.
[0057] Likewise, a current extraction unit of high level of semantic intermediate value (or current extraction unit) 806 can perform extraction of particularities in the current RGB k frame (or current frame in pose k). The current extraction unit 806 can have a current extraction neural network analysis unit 816 to perform extraction with one or more layers of neural network. For both current semantic extraction and historical semantic extraction, particularities in the form of particularity vectors (or matrices or tensors) can be extracted and can be supplied to a current-historical 820 combination unit. This unit can perform the concatenation of the vectors of
Petition 870180161488, of 12/11/2018, p. 103/234
25/59 particularities extracted from current and historical as described below. The vectors of concatenated features, or combined, are then used as input to a segmentation output unit 822 that has one last or another segmentation neural network. According to one form, the concatenated peculiarities are in the form of tensors, and the particularities are supplied to the last or another neural network of segmentation together, one matrix of the tensor at a time. The segmentation output unit 822 produces semantic tags, or class or probabilities for the tags or classes, and provides them as part of the current-historical (C-H - Current-Historical) semantically segmented (or only segmented) frame 824.
[0058] Returning to the system view 600, a semantic segmentation update unit in update stage 605 can receive the segmented frame 626 or 824 and project the semantic data from the segmentation frame onto an image plane in the new pose estimate from the new pose estimate unit 610 of geometric stage 602. The update unit 628 can then register the indicated semantic class or the segmented frame tags 626 in the new pose estimate in the 3D segmentation model 616. Again, they are provided below other details.
[0059] Now, for the explanation of process 500, this method may include obtaining image data from the current frame of the 502 video sequence and, as mentioned above, may include obtaining raw RGB data, pre-processing the data enough images for geometric and semantic segmentation, as well as other applications. The frame to be analyzed is considered the current frame, and each frame in a video sequence being analyzed can be considered the current frame. It will be understood that targeting
Petition 870180161488, of 12/11/2018, p. 104/234
26/59 can be applied to each frame in a video sequence or alternatively some frame interval, such as each 5th or 10th frame, or any interval that has been shown to reduce computational loads while still maintaining semantic segmentation Timely and accurate enough for a video stream. This may depend on the application using the segmentation.
[0060] Process 500 may include generating depth map 504, where a depth map for the current image can be formed to establish a 3D space for the video sequence being analyzed, and eventually used to generate a 3D geometric model. This may involve generating stereo depth or generating single camera depth.
[0061] Process 500 may include generating / updating 3D geometric model 506, and also as mentioned above, the 3D geometric model, which can be operated by performing RGB-SLAM 508, or other methods, can initially be formed from one or more maps depth. Accordingly, these RGBSLAM methods are used as revealed in Newcombe et al., KinectFusion: Real-time dense surface mapping and tracking, ISMAR (pp. 127 to 136), IEEE Computer Society (2011). Then, the 3D geometric model can be updated with each of the current frames being analyzed. It should be noted that the system can update previously known portions of the geometric model with each current frame being analyzed. Thus, in RGBD-SLAM here, each frame not only adds new portions, but also refines and improves existing areas of the geometric models that are seen in the frame.
[0062] Process 500 may include rendering image data from the 3D geometric model in pose K - 1 510, and this may include
Petition 870180161488, of 12/11/2018, p. 105/234
27/59
Ray Casting of the 3D geometric model in an image plane formed when the camera is in pose K - 1 (in this case, K and k refer to the same thing and are used interchangeably).
[0063] Process 500 may include determining a new pose estimate using pose K - 1 and the current frame in pose K 512, or in other words, the image data of the current frame captured when the camera is in pose K and the image data from the frame captured when the camera was in pose K - 1 is used by a new pose estimate unit to determine the new pose estimate. This can be done by algorithms of the closest iterative point (ICP - Iterative Closest Point) comparing the current image of pose k and an earlier image in pose k - 1.
[0064] Process 500 may include updating 3D semantic model geometry with geometric model data 514. In this case, the 3D semantic model geometry can be constructed by forming the voxel or 3D mesh structure using the 3D apex structure of the model 3D geometry to arrange the voxels and / or the 3D mesh of the semantic model. The semantic tags are then added to the 3D semantic model as generated. According to an example, the 3D semantic model is in the form of a voxel grid or a 3D mesh with vertex semantic information. Likewise, other methods are contemplated.
[0065] With respect to Figure 7, process 500 may include rendering segmentation map from the 3D semantic model 516, and this may involve obtaining the k pose of the current frame being analyzed, and then projecting the 3D semantic model onto a plane image formed by a camera in pose k. An example segmentation map 700 is provided in the current pose (or k pose) of the current frame, and where the walls 702, chairs 704 and the floor 706 shown on map 700 are segmented with each other and each has
Petition 870180161488, of 12/11/2018, p. 106/234
28/59 of them, a semantic tag (wall, chair, floor, for example) historically based (or influenced or based on information from previous frames). In some ways, objects that are adjacent to each other and have the same label may not appear as separate components on the segmentation map.
[0066] Process 500 may include performing 518 semantic segmentation. According to the examples here, the proposed method is not limited to any specific type of CNN architecture. In a specific implementation, which was used in experiments, a pyramid scene analysis network (PSPNet - PSP Network - Pyramid Scene Parsing) can be used and as described by Zhao et al., As the semantic segmentation network architecture of base. See Zhao et al., Pyramid Scene Parsing Network, Computer Vision and Pattern Recognition (CVPR) (2017).
[0067] Likewise, recurrent segmentation can be performed quickly, and while the rendered semantic maps are not generated in advance. The neural networks used for extraction and segmentation can be previously trained offline before being provided for actual executions. The training is explained below. In practice, the system can perform segmentation on each In ° frame, (typically In = 10), instead of performing the same on each frame. There are two reasons for this: (i) the system can use a slow computing segmentation architecture while the RGBD-SLAM algorithms continue to follow the camera pose and update the model's geometry; and (ii) reduce the computational cost of the algorithm that improves the system's power efficiency.
[0068] As part of the extraction of particularities, process 500 may include extracting particularities of intermediate value from high level historical semantics 520. In this case, the extraction of
Petition 870180161488, of 12/11/2018, p. 107/234
29/59 particularities are carried out on the semantic segmentation map, which already has semantic labels on the 3D semantic model map, and reflects the previous semantic labeling on the 3D semantic model. This extraction can be performed separately from the extraction of particularities of the current RGBD image. Specifically, this may first involve a preprocessing operation by the 810 preprocessing unit (Figure 8) that converts the rendered semantic map input into an expected format for introduction into a neural network. According to an example, each pixel can be represented by the most likely upper semantic tags. Thus, according to one form, the rendered semantic segmentation map frame can be provided in a structure of size W * H * 3, where three higher semantic classes are provided for each pixel, and the three upper classes are in descending order of probability, for example. The representation for a neural network can thus be found in a structure W * H * C (where W corresponds to width, H corresponds to height and C corresponds to the number of classes) or, in other words, a tensor, so that each pixel in the image is represented by a C-size vector of different likely semantic classes. Each entry C represents a class, and the order can be constant (for example, 1-chair, 2-table, 3-floor, etc.), and all other entry values correspond to 0 except the three entries that represent the 3 higher classes on the rendered map. For example, the entry that represents the best class may have a probability value of 1 Z>, the following may correspond to 1/3 and the following may correspond to 1/6.
[0069] For this operation, process 500 may include operating at least one layer of neural network in the current pose historical segmentation map data 522. Thus, the entry
Petition 870180161488, of 12/11/2018, p. 108/234
30/59 mentioned above can be delivered for extraction of particularities as input into a neural extraction network that has at least one convolution layer and / or one ResNet layer or layers for propagation. As performed in an experiment, the system can apply separate convolutions and a ResNet building block with normalization in batches in the semantic segmentation map input. See He et al., Deep residual learning for image recognition, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016). As mentioned, the output can correspond to pixel, voxel or segment level, high-level features (or intermediate semantic values) in the form of tensors or modified feature vectors. These particularity vectors can be provided for each pixel, or individual pixels, (or pixel locations) throughout the rendered segmentation map being analyzed. According to one form, the extraction neural network can be receiving a W x H matrix of the tensor each time, thus providing a class value of the pixels each time.
[0070] Likewise, process 500 may include extracting high-level intermediate semantic particularities from current semantics 524, and this may also involve operating at least one layer of neural network in the current frame data 526. Thus, this extraction also uses a neural extraction network, such as CNNs, also with one or more layers. In this case, however, the entry into the current extraction neural network corresponds to image data from the current image which can be RGB data and / or YUV luminance data, depth data or any combination of these, etc. The results can also be particularities or vectors of particularities placed in tensors to concatenate with historical particularities and then to
Petition 870180161488, of 12/11/2018, p. 109/234
31/59 be used as input to the neural segmentation network.
[0071] Process 500 may include combining both historical and current particulars 528, and this can be provided by the current-historical combination unit 820 according to an example. In this case, the system concatenates or combines particularities that were extracted from the current frame, and particularities that were extracted from the historically influenced semantic segmentation map. Specifically, the results of both passages and branches can be tensors (three-dimensional matrices), but they can also be considered as vectors (matrix and tensors are simply generalization of a vector). To provide a simplified explanatory example, let us assume that a result of the current frame extraction corresponds to a 3x1 feature vector (111, 222, 333) and a result of the extraction of the semantic segmentation map features to a 3x1 feature vector ( 444, 555, 666). In the concatenation stage, two particularity vectors can be concatenated into a single 6x1 particularity vector (111, 222, 333, 444, 555, 666). In this case, two output tensioners PxMxN and QxMxN are concatenated, and the result is a third tensioner (P + Q) xMxN. The second and third dimensions can remain the same while the first dimension is a sum of sizes of both tensors. This resulting tensor can be provided as the input into a last or another neural network with one or more CNN layers (which can be referred to as a separate segmentation neural network or CNN) and which produces classes of semantic segmentation or probabilities of them to be pixels were used to update the 3D semantic model as detailed below.
[0072] The resulting concatenations or combinations correspond to particularities of the same location in the current frame
Petition 870180161488, of 12/11/2018, p. 110/234
32/59 and on the rendered segmentation map. Thus, the result of the concatenation is a long vector where each vector represents an area in the image, and corresponds to the high-level particularities of that given area in the image.
[0073] Process 500 may then include introducing combinations as input to the 530 semantic segmentation classification neural network, and this may involve applying at least one convolution layer to the concatenated data in one last or another neural segmentation network semantics before updating the 3D semantic model, and can be provided by segmentation output unit 822 (Figure 8). According to an example, a concatenated vector, a tensor or a matrix corresponds to the input in the last layer (s) of neural network and the output corresponds to the semantic class or label probabilities for each segment, or individual segments, (or voxels or pixels) in the targeting frame. It will be understood that each voxel can have multiple tags, each with a probability, and can match all possible tags or classes. According to one form, however, the semantic tags with the highest probabilities (such as three) can be kept for each voxel as explained here.
[0074] Process 500 may include registering semantic class outputs to update 3D 532 semantic model, and this is done by placing the semantic tags of the output of the last layer of neural network and in the corresponding segments or voxels of the 3D semantic segmentation model. In order to conserve memory, candidate semantic tags of upper class X can be stored in each voxel. The 3D semantic model consists of a truncated signed distance function for each voxel of the RGBDSLAM algorithm and semantic data in the form of semantic classes X
Petition 870180161488, of 12/11/2018, p. 111/234
33/59 higher. In the experiment referred to here, X = 3.
[0075] It will be understood that the method disclosed here is carried out in an incremental manner for time, since new data are added, and in turn a new geographical area, to the 3D semantic model with each analysis of another frame. In contrast to Finman mentioned above, however, the new process also updates areas already existing in the 3D semantic model using the historical data from the rendered segmentation map.
[0076] Process 500 may include providing 3D semantic model to applications considering 534 semantics and, as mentioned, may correspond to computer vision applications, RV, RA or RM headset applications or any other application that may use the 3D semantic model .
[0077] Can process 500 include a last frame query in the sequence 536 and, if so, the process ends. Otherwise, process 500 may include obtaining the next current frame 538, when the video sequence has not yet ended. In this case, the image data of the new current frame is obtained and the process is repeated again in operation 504 to generate the depth map of the next current frame.
Training procedure [0078] The training of the aforementioned architecture in supervised learning sites may include an RGBD video sequence training set, where the frames in each sequence have semantic information. This video can be obtained using (i) a labor-intensive method by manually segmenting each frame, (ii) segmenting a reconstructed 3D model or (iii) using synthetic data. See Dai et al., Richly-annotated 3D Reconstructions of Indoor Scenes, Computer Vision and Pattern Recognition (CVPR) (2017).
Petition 870180161488, of 12/11/2018, p. 112/234
34/59 [0079] In addition, training a recurring network requires semantic maps rendered from the 3D semantic model. Training can be carried out in several operations.
[0080] The first training operation may involve initialization training a standard semantic segmentation network. First, a standard single frame CNN based semantic segmentation algorithm is trained. This resulting initial network can be denoted as ni for example.
[0081] The next training operation may involve data preparation, which refers to the generation of training data for the recurring architecture. Given the current network, training data was generated for the next recurring phase in the form of a triplet (RGBD frame, semantic map rendered from the 3D semantic model, semantic segmentation of terrestrial truth) as follows. The system is executed as shown in Figures 6 and 8 with the current network in short sequences of N frames, where N corresponds to an adjustable parameter. A corresponding semantic map was rendered for the last frame since the last camera pose in each sequence, and then saved with the frame as training data for the next stage. The semantic map was represented as an image of H * W pixels (the frame size) with C channels (the number of classes that the system supports). Since only X <C probabilities are remembered in each voxel, the lower C-X probabilities are truncated to zero, and the remaining X probabilities are again normalized to correspond to an appropriate distribution.
[0082] Another training operation then involves training the recurring architecture. Given the new training data (RGBD frame, semantic map rendered from the model of the last camera pose and real semantic segmentation
Petition 870180161488, of 12/11/2018, p. 113/234
35/59 terrestrial), the training (or regulation) is continued in the previous network with the additional branch, as described previously, and operated as in Figure 8. The training and data preparation of the recurring architecture can be repeated for several iterations. The final network can be called nfinai.
[0083] In addition, any one or more of the operations in Figures 4 and 5A and 5B can be performed in response to instructions provided by one or more computer program products. Such program products may include signal-carrying means providing instructions that, when executed by, for example, a processor, may provide the functionality described herein. Computer program products may be provided in any form in one or more machine-readable media. Thus, for example, a processor including one or more processor cores may perform one or more of the operations of the example processes here in response to the program code and / or instructions or instruction sets driven to the processor by one or more computer or machine readable means. In general, a machine-readable medium can conduct software in the form of program code and / or instructions or instruction sets that can cause any of the devices and / or systems to function as described herein. Machine- or computer-readable media may be a non-transitory medium or article, such as a non-transitory computer-readable medium, and may be used with any of the examples mentioned above or other examples, except that they do not include a transient sign per se. They include elements in addition to a signal per se that can temporarily retain data in a transient manner, such as RAM, etc.
[0084] As used in any implementation described here, the
Petition 870180161488, of 12/11/2018, p. 114/234
36/59 The term module refers to any combination of software logic, firmware logic and / or hardware logic configured to provide the functionality described here. The software may be incorporated as a software package, set of codes and / or instructions or instructions, and hardware, as used in any implementation described herein, may include, for example, individually or in any combination, set of connected circuits, set programmable circuits, state machine circuitry and / or fixed-function firmware that store instructions executed by the programmable circuitry. The modules can, collectively or individually, be contained as a set of circuits that are part of a larger system, for example, an integrated circuit (CI), system on a chip (SoC), etc. For example, a module can be contained in the logic circuitry for implementation via software, firmware or hardware of the coding systems discussed here.
[0085] As used in any implementation described here, the term logical unit refers to any combination of firmware logic and / or hardware logic configured to provide the functionality described here. The logic units can, collectively or individually, be incorporated as a set of circuits that are part of a larger system, for example, an integrated circuit (CI), system on a chip (SoC), etc. For example, a logic unit can be contained in the logic circuitry for implementation via firmware or hardware of the coding systems discussed here. One skilled in the art will recognize that operations performed by fixed-function hardware and / or firmware can alternatively be implemented via software, which can be contained as a software package, set of codes and / or instructions or instructions, and also recognize that the logic unit can also use
Petition 870180161488, of 12/11/2018, p. 115/234
37/59 a lot of software to implement its functionality.
[0086] As used in any implementation described here, the term component can refer to a module or a logical unit, since these terms are described above. Accordingly, the term component can refer to any combination of software logic, firmware logic and / or hardware logic configured to provide the functionality described herein. For example, one skilled in the art will recognize that operations performed by hardware and / or firmware can alternatively be implemented via a software module, which can be contained as a software package, set of codes and / or instructions, and also recognize that a logical unit can also use a piece of software to implement its functionality.
[0087] With respect to Figure 9, an example image processing system 900 is arranged according to at least some implementations of the present disclosure. In various implementations, the example image processing system 900 may have an imaging device 902 for forming or receiving captured image data. This can be implemented in several ways. Thus, in one way, the 900 image processing system can correspond to one or more digital cameras or other image capture devices, and the 902 image device, in this case, can be the camera hardware and sensor software camera, module or component. In other examples, the imaging system 900 may have an imaging device 902 that includes or may be one or more cameras, and logic modules 904 may communicate remotely with, or may be communicatively coupled to, the imaging device 902 for further processing
Petition 870180161488, of 12/11/2018, p. 116/234
38/59 of the image data.
[0088] Thus, the 900 image processing system can be a single camera alone or can be found in a multiple camera device, which can be a smartphone, tablet, laptop or other mobile device, but particularly here it can correspond to sensors and computer vision cameras, and / or headsets RV, RA or RM, glasses or other head accessory positioned over a person's eyes. Otherwise, the 900 system can be the device with multiple cameras where processing takes place in one of the cameras or in a separate processing location communicating with the cameras either inside or outside the device, and whether or not the processing is performed on a device. mobile.
[0089] In any of these cases, this technology can include a camera, such as a digital camera system, a specialized camera device, or a tablet or imaging phone, or another video camera, camera, including a headset that receives a smartphone for example, or some combination of them. Thus, in one form, the 902 imaging device may include camera and optics hardware including one or more sensors, as well as auto focus, zoom, aperture, ND filter, auto exposure, flash and trigger controls. These controls can be part of a sensor module or component for operating the sensor that can be used to generate images for a viewfinder and capture still images or video. The 902 imaging device may also have a lens, an image sensor with a Bayer RGB color filter, an analog amplifier, an A / D converter, other components for converting incident light into a digital signal, the like, and / or combinations of the same. The digital signal can also be referred to here as the raw image data.
Petition 870180161488, of 12/11/2018, p. 117/234
39/59 [0090] Other forms include a camera sensor-like imaging device or the like (for example, a webcam or webcam sensor or other complementary metal oxide semiconductor (CMOS) image sensor) ) in addition to, or instead of, using a red-green-blue depth camera (RGB - Red-Green-Blue) and / or microphone array to locate the speaker. The camera sensor can also support other types of electronic shutters, such as global shutter in addition to, or instead of, rolling shutter, and many other types of shutter. In other examples, an RGB depth camera and / or microphone array may be used as an alternative to a camera sensor. In these examples, in addition to a camera sensor, the same sensor or a separate sensor can also be provided as a light projector, such as an IR projector to provide a separate depth image that can be used for triangulation with the camera image . Otherwise, the imaging device may have any other known technology for providing depth maps using multiple camera or imaging devices, or a single imaging device.
[0091] In the example illustrated and relevant here, logic modules 904 may include a raw image processing unit 906 that performs pre-processing on the image data sufficient for segmentation, but which may also be sufficient for generating a map of depth or depth image, a 908 depth map generation unit that performs depth algorithms typically on multiple images of the same scene, and to form a three-dimensional space where pixels or points have three-dimensional coordinates (x, y, z) in a depth map or resulting depth image that represents the
Petition 870180161488, of 12/11/2018, p. 118/234
40/59 three-dimensional space (or 2D image or set of images from the same scene).
[0092] The logic modules can also have an image segmentation unit 910 to perform many of the operations already described here. Thus, for example, segmentation unit 910 may have a geometric segmentation unit 912 that forms the new pose estimates and maintains the 3D geometric model as described above. A 940 semantic frame segmentation unit can be provided to perform recurring semantic segmentation as described above. To accomplish these tasks, the geometric segmentation unit 912 may have a K - 1 914 pose rendering unit, a new est unit. pose 916 and a geometric model update unit 918, and as described with units similarly named above or that easily perform recognized tasks as described above. Similarly, the semantic frame segmentation unit 940 may have a semantic segmentation map unit 942, a unit of historical semantic features of extraction 944, a unit of current particularities of extraction 946, a current-historical combination unit 948 and a segmentation output unit 950 performing tasks as described above. A semantic segmentation update unit 952 is provided to update a 3D semantic model with the semantic output of segmentation output unit 952, also as described above.
[0093] The 900 image processing system may have one or more 920 processors which may include a specialized 922 image signal processor (ISP - Image Signal Processor), such as the Intel Atom 924 memory stores, one or more 928 displays to provide 930 images, an encoder
Petition 870180161488, of 12/11/2018, p. 119/234
41/59
932 and antenna 926. In an example implementation, the image processing system 900 may have display 928, at least one processor 920 communicatively coupled to the display and at least one memory 924 communicatively coupled to the processor. The 932 encoder can be an encoder, decoder or both. As an encoder 932, and with antenna 934, the encoder can be provided to compress image data for transmission to other devices that can display or store the image. It will be understood that, as a decoder, the encoder can receive and decode image data for processing by the 900 system to receive images for segmentation in addition to, or instead of, initially capturing the images with the 900 device. Otherwise, the processed image 930 can be displayed on display 928 or stored in memory 924. As illustrated, any one of these components may be capable of communicating with each other and / or communicating with portions of logic modules 904 and / or a 902 imaging device. Processors 920 can be communicatively coupled to both the imaging device 902 and logic modules 904 to operate these components. According to one approach, although the image processing system 900, as shown in Figure 9, can include a particular set of blocks or actions associated with particular components or modules, those blocks or those actions can be associated with different components or modules in comparison with the particular component or module illustrated here. [0094] With reference to Figure 10, an example system 1000 in accordance with the present disclosure operates one or more aspects of the image processing system described herein. It will be understood from the nature of the system components described below that these components can be associated with, or used to operate,
Petition 870180161488, of 12/11/2018, p. 120/234
42/59 certain part or parts of the image processing system 1000 described above and, consequently, used to operate the methods described herein. In several implementations, system 1000 can be a media system, although system 1000 is not limited to that context. For example, the 1000 system can be incorporated into a digital photo camera, a digital video camera, a mobile device with camera or video functions, such as an image phone, a webcam, a personal computer (PC), a a laptop computer, an ultra-laptop computer, a tablet with multiple cameras, a touchpad, a portable computer, a handheld computer, a palmtop computer, a Personal Digital Assistant (PDA), a cell phone, a telephone combination cell phone / PDA, a television, a smart device (e.g. smartphone, smart tablet or smart television), a mobile Internet device (MID - Mobile Internet Device), a messaging device, a data communication device, etc.
[0095] In several implementations, the system 1000 includes a platform 1002 coupled to a display 1020. Platform 1002 can receive content from a content device, such as content service device (s) 1030 or device (s) from 1040 content distribution or other similar content sources. A navigation controller 1050 including one or more navigation features can be used to interact with, for example, platform 1002 and / or display 1020. Each of these components is described in more detail below.
[0096] In various implementations, platform 1002 can include any combination of a 1005 chipset, a 1010 processor, a 1012 memory, a 1014 storage, a 1015 graphics subsystem, 1016 applications and / or a 1018 radio. The 1005 chipset can
Petition 870180161488, of 12/11/2018, p. 121/234
43/59 provide intercommunication between processor 1010, memory 1012, storage 1014, graphics subsystem 1015, applications 1016 and / or radio 1018. For example, the chipset 1005 may include a storage adapter (not pictured) capable of providing intercom with storage 1014.
[0097] Processor 1010 can be implemented as a Computer processor with a Complex Instruction Set Computer (CISC) or Computer with a Reduced Instruction Set Computer (RISC); processors compatible with x86 instruction sets, multiple cores or any other microprocessor or central processing unit (CPU). In several implementations, the 1010 processor can correspond to dual-core processor (s), mobile dual-core processor (s), etc.
[0098] The 1012 memory can be implemented as a volatile memory device, such as, but not limited to, a Random Access Memory (RAM), Dynamic Random Access Memory (DRAM) or RAM Static (SRAM - Static RAM).
[0099] Storage 1014 can be implemented as a non-volatile storage device, such as, but not limited to, a magnetic disk drive, an optical disk drive, a tape drive, an internal storage device, a connected storage device, flash memory, battery backed SDRAM (synchronous DRAM) and / or an accessible network storage device. In various implementations, 1014 storage can include technology to enhance the enhanced protection of storage performance for valuable digital media when multiple disks are included
Petition 870180161488, of 12/11/2018, p. 122/234
44/59 rigid, for example.
[0100] The graphics subsystem 1015 can perform image processing, such as static or video for presentation. The graphics subsystem 1015 can be a graphics processing unit (GPU - Graphics Processing Unit) or a visual processing unit (VPU - Visual Processing Unit), for example. An analog or digital interface can be used to connect the 1015 graphics subsystem and the 1020 display communicatively. For example, the interface can be any one between a High-Definition Multimedia Interface (HDMI), Display Port , Wireless HDMI and / or wireless HD compatible techniques. The graphics subsystem 1015 can be integrated into the processor 1010 or chipset 1005. In some implementations, the graphics subsystem 1015 can be a standalone card communicatively coupled to the chipset 1005.
[0101] The graphics and / or video processing techniques described here can be implemented in various hardware architectures. For example, graphics and / or video functionality can be integrated into a chipset. Alternatively, a separate graphics and / or video processor can be used. As yet another implementation, graphics and / or video functions can be provided by a general-purpose processor, including a multi-core processor. In other implementations, the functions can be implemented in a consumer electronics device.
[0102] The 1018 radio may include one or more radios capable of transmitting and receiving signals using various suitable wireless communication techniques. These techniques can involve communications over one or more wireless networks. Sample wireless networks include (but are not limited to) Wireless Local Area Networks (WLANs), personal area networks without
Petition 870180161488, of 12/11/2018, p. 123/234
45/59 wireless (WPANs - Wireless Personal Area Networks), wireless metropolitan area networks (WMANs - Wireless Metropolitan Area Networks), cellular networks and satellite networks. When communicating on these networks, the 1018 radio can operate according to one or more standards applicable in any version.
[0103] In several implementations, the 1020 display can include any television-type display or monitor. The display 1020 may include, for example, a computer display screen, a touch screen display, a video monitor, a television-type device and / or a television. The 1020 display can be digital and / or analog. In several implementations, the 1020 display can be a holographic display. Likewise, the display 1020 can be a transparent surface that can receive a visual projection. These projections can lead to various forms of information, images and / or objects. For example, these projections can be a visual overlay for a mobile augmented reality (MAR - Mobile Augmented Reality) application. Under the control of one or more 1016 software applications, platform 1002 can display user interface 1022 on display 1020.
[0104] In several implementations, the 1030 content services device (s) can be hosted by any national, international and / or independent service and, therefore, accessible to the 1002 platform via the Internet , for example. Content service device (s) 1030 may be coupled to platform 1002 and / or display 1020. Platform 1002 and / or content service device (s) 1030 may be coupled to a 1060 network to communicate (for example, send and / or receive) multimedia information to, and from, the 1060 network. The content delivery device (s) 1040 can also be coupled to platform 1002 and / or display 1020.
Petition 870180161488, of 12/11/2018, p. 124/234
46/59 [0105] In various implementations, the 1030 content services device (s) may include a cable television box, a personal computer, a network, a telephone, device or devices activated by Internet capable of distributing digital information and / or content, and any other similar device capable of communicating content unidirectionally or bidirectionally between content providers and platform 1002 and display 1020, via network 1060 or directly. It will be recognized that content can be communicated unidirectionally and / or bidirectionally to and from any of the components in the 1000 system and a content provider via the 1060 network. Examples of content can include any multimedia information including, for example, video, music , medical and gaming information, etc.
[0106] The 1030 content services device (s) can receive content, such as cable television programming including multimedia information, digital information and / or other content. Examples of content providers can include any radio or satellite or cable television or Internet content providers. The examples provided are not intended to limit implementations in any way in accordance with the present disclosure.
[0107] In various implementations, platform 1002 can receive control signals from the navigation controller 1050 having one or more particularities of navigation. The navigation particulars of the 1050 controller can be used to interact with the 1022 user interface, for example. In implementations, the 1050 navigation controller can be a pointing device that can be a component of computer hardware (specifically, a human interface device) that allows a user to enter spatial data (for example, continuous and
Petition 870180161488, of 12/11/2018, p. 125/234
47/59 multidimensional) on a computer. Many systems, such as graphical user interfaces (GUI - Graphical User Interfaces), and televisions and monitors, allow the user to control and provide data to the computer or television using physical gestures. [0108] The movements of the navigation characteristics of the 1050 controller can be replicated on a display (for example, display 1020) by movements of a pointer, cursor, focus ring or other visual indicators shown on the display. For example, under the control of software applications 1016, the navigation features located on the navigation controller 1050 can be mapped to virtual navigation features presented in the 1022 user interface, for example. In implementations, controller 1050 may not be a separate component, but may be integrated into platform 1002 and / or display 1020. The present disclosure, however, is not limited to the elements or the context shown or described herein.
[0109] In various implementations, controllers (not shown) can include technology to allow users to instantly turn the 1002 platform on and off, such as a television, at the touch of a button after initial startup, when activated, for example. Program logic may allow platform 1002 to deliver content to multimedia adapters or other content service device (s) 1030 or content delivery device (s) 1040 even when the platform is off. In addition, the 1005 chipset may include hardware and / or software support for 8.1 surround sound audio and / or high definition surround sound audio (7.1), for example. Controllers can include a graphics controller for integrated graphics platforms. In implementations, the graphics controller can comprise a PCI graphics card
Petition 870180161488, of 12/11/2018, p. 126/234
48/59 (Peripheral Component Interconnect - Express Peripheral Component Interconnect) Express.
[0110] In various implementations, any one or more of the components shown in the 1000 system can be integrated. For example, platform 1002 and content service devices 1030 can be integrated, or platform 1002 and content delivery device (s) 1040 can be integrated, or platform 1002, o ( s) content service device (s) 1030 and content delivery device (s) 1040, for example. In various implementations, platform 1002 and display 1020 can be an integrated unit. The display 1020 and the content service device (s) 1030 can be integrated, or the display 1020 and the content delivery device (s) 1040 can be integrated, for example. These examples are not intended to limit the present disclosure.
[0111] In several implementations, the system 1000 can be implemented as a wireless system, a wired system or a combination of both. When implemented as a wireless system, system 1000 may include components and interfaces suitable for communication on shared wireless media, such as one or more antennas 1003, transmitters, receivers, transceivers, amplifiers, filters, control logic, etc. . An example of wireless shared media can include portions of a wireless spectrum, such as the RF spectrum, etc. When implemented as a wired system, system 1000 can include components and interfaces suitable for communication on wired media, such as input / output (I / O) adapters, physical connectors to connect the I / O adapter S with a corresponding wired communication medium, a Network Interface Card (NIC), disk controller,
Petition 870180161488, of 12/11/2018, p. 127/234
49/59 video, audio controller and the like. Examples of wired media can include a wire, a cable, metal conductors, printed circuit board (PCB), backplane integration, switch structure, semiconductor material, twisted pair of wires , coaxial cable, optical fiber, etc.
[0112] Platform 1002 can establish one or more logical or physical channels for the communication of information. The information can include multimedia information and control information. Multimedia information can refer to any data representing content intended for a user. Examples of content may include, for example, data from a voice conversation, video conference, streaming video, email (email), text message (sending SMS), social media formats, message voice mail, alphanumeric symbols, graphics, image, video, text, etc. The data of a voice conversation can be, for example, voice information, periods of silence, background noise, comfort noise, tones, etc. Control information can refer to any data representing commands, instructions or control words intended for an automated system. For example, control information can be used to route multimedia information through a system, or instruct a node to process multimedia information in a predetermined manner. However, implementations are not limited to the elements or in the context shown or described in Figure 10.
[0113] Referring to Figure 11, a small form factor 1100 device is an example of the varying physical styles or form factors in which 900 or 1000 systems can be contained. According to this approach, the 1100 device can be implemented as a
Petition 870180161488, of 12/11/2018, p. 128/234
50/59 mobile computing device with wireless capabilities. A mobile computing device can refer to any device having a processing system and a mobile power supply or supply, such as one or more batteries, for example.
[0114] As described above, examples of a mobile computing device can include a digital photo camera, a digital video camera, mobile devices with camera or video functions, such as image phones, a webcam, a personal computer (PC), a laptop computer, an ultra-laptop computer, a tablet, a touchpad, a portable computer, a handheld computer, a palmtop computer, a personal digital assistant (PDA), a cell phone, a cell phone / PDA, television, smart device (e.g. smartphone, smart tablet or smart television), mobile Internet device (MID), messaging device, data communication device, etc.
[0115] Examples of a mobile computing device can also include computers that are arranged for use by a person, such as a wrist computer, finger computer, ring computer, eyeglass computer, belt loop computer , armband computer, shoe computers, clothing computers and other wearable computers. In various embodiments, for example, a mobile computing device can be implemented as a smartphone capable of running computer applications, as well as voice and / or data communications. Although some modalities can be described with a mobile computing device implemented as a smartphone as an example, it must be recognized that they can be implemented
Petition 870180161488, of 12/11/2018, p. 129/234
51/59 other modalities also using other wireless mobile computing devices. Implementations are not limited in this context.
[0116] As shown in Figure 11, device 1100 may include a housing with a front 1101 and a rear 1102. Device 1100 includes a display 1104, an input / output (I / O) device 1106 and an integrated antenna 1108 The device 1100 can also include navigation features 1112. The I / O device 1106 can include any I / O device suitable for inputting information into a mobile computing device. Examples for the 1106 I / O device may include an alphanumeric keyboard, a numeric keypad, a touchpad, input keys, buttons, switches, microphones, speakers, software and speech recognition device, etc. The information can also be entered into the device 1100 by microphone 1114 or can be digitized by a speech recognition device. As shown, the device 1100 may include a camera 1105 (for example, including at least one lens, an aperture and an image sensor) and an illuminator 1110, such as those described here, integrated into the rear 1102 (or elsewhere) ) of device 1100. Implementations are not limited in this context.
[0117] Various forms of the devices and processes described here can be implemented using hardware elements, software elements or a combination of both. Examples of hardware elements can include processors, microprocessors, circuits, circuit elements (eg, transistors, resistors, capacitors, inductors, etc.), integrated circuits, application specific integrated circuits (ASIC Application Specific Integrated Circuits), devices logical
Petition 870180161488, of 12/11/2018, p. 130/234
52/59 programmable (PLD - Programmable Logic Devices), digital signal processors (DSP - Digital Signal Processors), field programmable gate arrangement (FPGA - Field Programmable Gate Array), logic gates, registers, semiconductor device, chips, microchips , chipsets, etc. Software examples can include software components, programs, applications, computer programs, application programs, system programs, machine programs, operating system software, middleware, firmware, software modules, routines, subroutines, functions , methods, procedures, software interfaces, application program interfaces (API), instruction sets, computer code, computer code, code segments, computer code segments, words, values, symbols or any combination of them. The determination of whether a modality is implemented using hardware elements and / or software elements can vary according to any number of factors, such as the desired computational rate, power levels, heat tolerances, processing cycle provision, input data rates, output data rates, memory resources, data bus speeds and other design or performance restrictions.
[0118] One or more aspects of at least one modality can be implemented by representative instructions stored in a machine-readable medium that represents various logics within the processor, which when read by a machine cause the machine to manufacture logic to perform the techniques described here. These representations, known as IP cores, can be stored in a machine-readable medium, tangible and provided to multiple customers or manufacturing facilities to load onto the manufacturing machines that actually create the logic or the
Petition 870180161488, of 12/11/2018, p. 131/234
53/59 processor.
[0119] Although certain features presented here have been described with reference to various implementations, this description is not intended to be interpreted in a limiting sense. Therefore, several modifications to the implementations described here, as well as other implementations, which are evident to those skilled in the art to which the present disclosure belongs, are considered to be part of the spirit and scope of the present disclosure.
[0120] The following examples pertain to more implementations. [0121] According to an example implementation, a computer implemented method of semantic segmentation for image processing, comprising obtaining a video sequence of frames of image data and comprising a current frame; recurrently generating a semantic segmentation map in a view of a current pose of the current frame and comprising obtaining data to form the semantic segmentation map of a 3D semantic segmentation model, in which the individual semantic segmentation maps are each of them, associated with a different current frame of the video sequence; extracting historically influenced semantic features from the semantic segmentation map; extracting current semantic particulars from the current frame; generating a current and historical semantically segmented frame comprising the use of both the current semantic particularities and the historically influenced semantic particularities as input into a neural network that indicates semantic tags for areas of the current and historical semantically segmented frame; and semantically updating the 3D semantic segmentation model including the use of the current and historical semantically segmented frame.
Petition 870180161488, of 12/11/2018, p. 132/234
54/59 [0122] According to another implementation, this method may comprise the geometric update of the 3D semantic segmentation model with data from individual current frames as the video sequence is being analyzed; and rendering of an updated semantic segmentation map in a current pose of each current frame used to geometrically update the 3D semantic segmentation model, in which the extraction of historically influenced semantic particularities from the semantic segmentation map comprises the introduction of tag data from semantic segmentation of the semantic segmentation map in a neural network of extraction and the generation of the semantic particularities, the method comprising the placement of the semantic segmentation tag data in the form of tensors where one of the dimensions of the tensor corresponds to multiple semantic tags for a single pixel location, where the neural network uses a convolutive neural network (CNN - Convolutive Neural Network) which is a residual network (ResNet - Residual Network), in which the current semantic and historical semantic features are both in the form in tensors in which one of the dimensions of the tensor corresponds to multiple probably semantic tags for a single pixel location, and in which the generation of the current and historical semantically segmented frame comprises the concatenation of the current semantic and semantic particularities historically influenced to form a vector of particularities of entry both with the current semantic particularities and with the historically influenced semantic particularities and to be introduced in the neural network. The method comprising the matching of the image location of the current semantic particularities and the particularities
Petition 870180161488, of 12/11/2018, p. 133/234
55/59 historically influenced semantics, so that individual input feature vectors represent a single area of the image, where feature vectors are part of the tensors, and the concatenation comprises concatenation of tensors forming the input feature vectors; and the method comprising the introduction of a matrix at a time in the neural network and from the concatenated tensors, and in which the 3D semantic segmentation model is geometrically updated using a red, green, blue depth scheme with simultaneous location and mapping (RGBD -SLAM), the method comprising determining a new pose estimate using both the current frame in a current pose and a rendered image from a previous pose, in which the rendered image is obtained by Ray Casting projection from a separate 3D geometric model the 3D semantic segmentation model; providing the current and historical semantically segmented frame in the new pose estimate; and updating the 3D semantic segmentation model comprising the semantic tagging of the current and historical semantically segmented frame in the new pose estimate and in the 3D semantic segmentation model.
[0123] According to another implementation, a system implemented by a semantic segmentation computer for image processing, comprising at least one display; at least one memory; at least one processor connected communicatively to the display and memory; and a semantic segmentation unit operated by at least one processor and to operate: obtaining a video sequence of frames of image data and comprising a current frame; recurrently generating a semantic segmentation map in a view of a current pose of the current frame and comprising obtaining data to form the
Petition 870180161488, of 12/11/2018, p. 134/234
56/59 semantic segmentation map from a 3D semantic segmentation model, in which individual semantic segmentation maps are each associated with a different current frame of the video sequence; extracting historically influenced semantic features from the semantic segmentation map; extracting current semantic particulars from the current frame; generating a current and historical semantically segmented frame comprising the use of both the current semantic particularities and the historically influenced semantic particularities as input into a neural network that indicates semantic tags for areas of the current and historical semantically segmented frame; and semantically updating the 3D semantic segmentation model including the use of the current and historical semantically segmented frame.
[0124] The system may also include in which the semantic segmentation unit is intended to operate by geometrically updating the 3D semantic segmentation model with data from individual current frames as the video sequence is being analyzed, in which the segmentation unit semantics is intended to operate by rendering an updated semantic segmentation map in a current pose of individual current frames used to geometrically update the 3D segmentation model, in which the extraction of historically influenced semantic particularities from the semantic segmentation map comprises the introduction of data from semantic segmentation tag from the semantic segmentation map in a neural extraction network and the generation of semantic particularities, in which the semantic segmentation unit is intended to operate by placing the semantic segmentation tag data in the form of tensors where one of the dimensions do t ensor matches multiple tags
Petition 870180161488, of 12/11/2018, p. 135/234
57/59 probably semantic for a single pixel location, in which the neural network uses a convolutive neural network (CNN) that is a residual network (ResNet), in which the generation of the current and historical semantically segmented frame comprises the concatenation of the particularities current semantics and historically influenced semantic particularities to form a vector of input particularities both with current semantic particulars and historically influenced semantic particulars and to be introduced into the neural network. [0125] As another implementation, at least a computer-readable medium having stored in it instructions that, when executed, cause a computing device to operate: obtaining a video sequence of frames of image data and comprising a current frame; recurrently generating a semantic segmentation map in a view of a current pose of the current frame and comprising obtaining data to form the semantic segmentation map of a 3D semantic segmentation model, in which the individual semantic segmentation maps are each of them, associated with a different current frame of the video sequence; extracting historically influenced semantic features from the semantic segmentation map; extracting current semantic particularities from the current frame; generating a current and historical semantically segmented frame comprising the use of both the current semantic particularities and the historically influenced semantic particularities as input into a neural network that indicates semantic tags for areas of the current and historical semantically segmented frame; and semantically updating the 3D semantic segmentation model including the use of the current and historical semantically segmented frame.
Petition 870180161488, of 12/11/2018, p. 136/234
58/59 [0126] The instructions can also cause the computing device to include in which the generation of the current and historical semantically segmented frame comprises the concatenation of the current semantic particularities and the semantic particularities historically influenced to form a vector of input particularities both with the current semantic particularities and with the historically influenced semantic particularities and to be introduced into the neural network, where the instructions make the computing device operate corresponding to the image location of the current semantic particularities and the historically influenced semantic particularities, so that individual input particularity vectors represent a single area of the image, where the particularity vectors form part of the tensors, and the concatenation comprises concatenation of tensors forming the particularity vectors input; and the method comprising the introduction of a matrix at a time in the neural network and from the concatenated tensors, in which the 3D semantic segmentation model is geometrically updated using a red, green, blue depth scheme with simultaneous location and mapping (RGBD- SLAM), where the instructions cause the computing device to operate by determining a new pose estimate using both the current image in a current pose and a rendered image from a previous pose, in which the rendered image is obtained by Ray Casting projection from a 3D geometric model separate from the 3D semantic segmentation model; providing the current semantically segmented frame in the new pose estimate; and updating the 3D semantic model comprising the semantic tag registration of the current semantically segmented frame in the new pose estimate and in the
Petition 870180161488, of 12/11/2018, p. 137/234
59/59 3D semantic model.
[0127] In another example, at least one machine-readable medium may include a plurality of instructions that, in response to being executed on a computing device, cause the computing device to perform the method in accordance with any of the examples above.
[0128] In yet another example, an apparatus may include means for carrying out the methods according to any of the examples above.
[0129] The examples above may include a specific combination of features. However, the examples above are not limited in this regard and, in several implementations, the examples above may include realizing a subset of these particularities, realizing a different order of these particularities, realizing a different combination of these particularities and / or the realization of additional features in addition to the features explicitly listed. For example, all the features described with respect to any example methods here can be implemented with respect to any example apparatus, any example systems and / or any example articles, and vice versa.
权利要求:
Claims (25)
[1]
1. Computer implemented method of semantic segmentation for image processing characterized by the fact that it comprises:
obtaining a video sequence of frames of image data and comprising a current frame;
recurring generation of a semantic segmentation map in a view of a current pose of the current frame and comprising obtaining data to form the semantic segmentation map from a 3D semantic segmentation model, in which individual semantic segmentation maps are, each of them, associated with a current frame different from the video sequence;
extraction of semantic particularities historically influenced from the semantic segmentation map;
extraction of current semantic particularities from the current frame;
generation of a current and historical semantically segmented frame comprising the use of both the current semantic particularities and the historically influenced semantic particularities as input into a neural network that indicates semantic tags for areas of the current and historical semantically segmented frame; and semantically updating the 3D semantic segmentation model including the use of the current and historical semantically segmented frame.
[2]
2. Method, according to claim 1, characterized by the fact that it comprises the geometric update of the 3D semantic segmentation model with data of individual current frames as the video sequence is
Petition 870180161488, of 12/11/2018, p. 139/234
2/8 being analyzed.
[3]
3. Method, according to claim 1, characterized by the fact that it comprises the rendering of an updated semantic segmentation map in a current pose of each current frame used to geometrically update the 3D semantic segmentation model.
[4]
4. Method, according to claim 1, characterized by the fact that the extraction of historically influenced semantic particularities from the semantic segmentation map comprises the introduction of semantic segmentation tag data from the semantic segmentation map into a neural network of extraction and the generation of semantic particularities.
[5]
5. Method, according to claim 4, characterized by the fact that it comprises the placement of the semantic segmentation tag data in the form of tensors where one of the dimensions of the tensor corresponds to multiple probably semantic tags for a single pixel location.
[6]
6. Method, according to claim 4, characterized by the fact that the neural network uses a convolutive neural network (CNN) which is a residual network (ResNet).
[7]
7. Method, according to claim 1, characterized by the fact that the current semantic features and the historical semantic features are both in the form of tensors in which one of the dimensions of the tensor corresponds to multiple probably semantic labels for a single location of pixel.
[8]
8. Method, according to claim 1, characterized by the fact that the generation of the current and historical semantically segmented frame comprises the concatenation of the current semantic and semantic particularities
Petition 870180161488, of 12/11/2018, p. 140/234
3/8 historically influenced to form a vector of input particularities both with current semantic particulars and with historically influenced semantic particulars and to be introduced into the neural network.
[9]
9. Method, according to claim 8, characterized by the fact that it comprises the correspondence of the location of images of the current semantic particularities and of the historically influenced semantic particularities, so that individual input particularity vectors represent a single area of the image.
[10]
10. Method, according to claim 8, characterized by the fact that the particularity vectors constitute a part of the tensors, and the concatenation comprises concatenation of tensors forming the input particularity vectors; and the method comprising the introduction of a matrix at a time in the neural network and from the concatenated tensors.
[11]
11. Method, according to claim 1, characterized by the fact that the 3D semantic segmentation model is geometrically updated using a red, green, blue depth scheme with simultaneous location and mapping (RGBD-SLAM).
[12]
12. Method, according to claim 11, characterized by the fact that it comprises the determination of a new pose estimate using both the current frame in a current pose and a rendered image from a previous pose, in which the rendered image is obtained by Ray Casting projection from a 3D geometric model separate from the 3D semantic segmentation model;
the provision of the current and historical semantically segmented frame in the new pose estimate; and updating the model
Petition 870180161488, of 12/11/2018, p. 141/234
4/8 3D semantic segmentation including the record of semantic tags of the current and historical semantically segmented frame in the new pose estimate and in the 3D semantic segmentation model.
[13]
13. Computer implemented system of semantic segmentation for image processing characterized by the fact that it comprises:
at least one display;
at least one memory;
at least one processor communicatively coupled to the display and memory; and a semantic segmentation unit operated by at least one processor and to operate:
obtaining a video sequence of frames of image data and comprising a current frame;
recurrently generating a semantic segmentation map in a view of a current pose of the current frame and comprising obtaining data to form the semantic segmentation map of a 3D semantic segmentation model, in which individual semantic segmentation maps are each , associated with a current frame different from the video sequence;
extracting historically influenced semantic features from the semantic segmentation map;
extracting current semantic particulars from the current frame;
generating a current and historical semantically segmented frame comprising the use of both the current semantic particularities and the historically influenced semantic particularities as input into a neural network that indicates semantic tags for areas of the current and historical semantically segmented frame; and
Petition 870180161488, of 12/11/2018, p. 142/234
5/8 semantically updating the 3D semantic segmentation model including the use of the current and historical semantically segmented frame.
[14]
14. System, according to claim 13, characterized by the fact that the semantic segmentation unit is intended to operate by geometrically updating the 3D semantic segmentation model with data from individual current frames as the video sequence is being analyzed.
[15]
15. System, according to claim 13, characterized by the fact that the semantic segmentation unit is intended to operate by rendering an updated semantic segmentation map in a current pose of individual current frames used to geometrically update the 3D segmentation model.
[16]
16. System, according to claim 13, characterized by the fact that the extraction of historically influenced semantic particularities from the semantic segmentation map comprises the introduction of semantic segmentation tag data from the semantic segmentation map into a neural network of extraction and the generation of semantic particularities.
[17]
17. System, according to claim 16, characterized by the fact that the semantic segmentation unit is intended to operate by placing the semantic segmentation label data in the form of tensors where one of the dimensions of the tensor corresponds to multiple probably semantic labels for a single pixel location.
[18]
18. System according to claim 16, characterized by the fact that the neural network uses a convolutive neural network (CNN) which is a residual network (ResNet).
[19]
19. System, according to claim 13, characterized by the fact that the generation of the frame semantically
Petition 870180161488, of 12/11/2018, p. 143/234
The current and historical segmented 6/8 comprises the concatenation of the current semantic particularities and the historically influenced semantic particularities to form an input particularity vector both with the current semantic particularities and with the historically influenced semantic particularities and to be introduced into the neural network.
[20]
20. Storage medium that has computer-readable stored instructions, characterized in that when these instructions are executed, they cause the execution of any of the methods as defined in any one of claims 1 to 12, which comprise:
obtaining a video sequence of frames of image data and comprising a current frame;
recurrently generate a semantic segmentation map in a view of a current pose of the current frame and comprising obtaining data to form the semantic segmentation map of a 3D semantic segmentation model, in which individual semantic segmentation maps are each , associated with a current frame different from the video sequence;
extract historically influenced semantic features from the semantic segmentation map;
extract current semantic particularities from the current frame;
generate a current and historical semantically segmented frame comprising the use of both the current semantic particularities and the historically influenced semantic particularities as input into a neural network that indicates semantic tags for areas of the current and historical semantically segmented frame; and semantically update the targeting model
Petition 870180161488, of 12/11/2018, p. 144/234
7/8 3D semantics including the use of the current and historical semantically segmented frame.
[21]
21. Medium, according to claim 20, characterized by the fact that the generation of the current and historical semantically segmented frame comprises the concatenation of the current semantic particularities and the historically influenced semantic particularities to form a vector of input particularities with both the particularities current semantics as well as the historically influenced semantic particulars and to be introduced into the neural network.
[22]
22. Medium, according to claim 21, characterized by the fact that the instructions make the computing device operate corresponding to the location of images of the current semantic particulars and the historically influenced semantic particularities, so that input particularity vectors represent a single area of the image.
[23]
23. Medium, according to claim 22, characterized by the fact that the particularity vectors form a part of the tensors, and the concatenation comprises concatenation of tensors forming the input particularity vectors; and the method comprising the introduction of a matrix at a time in the neural network and from the concatenated tensors.
[24]
24. Medium according to claim 20 characterized by the fact that the 3D semantic segmentation model is geometrically updated by using a red, green, blue depth scheme with simultaneous location and image generation (RGBD-SLAM).
[25]
25. Medium according to claim 24 characterized by operating to determine a new pose estimate by using both the current image and a rendered image from pose
Petition 870180161488, of 12/11/2018, p. 145/234
8/8 above, in which the rendered image is obtained by raycast projection from a 3D geometric model separated from a 3D semantic segmentation model;
provide the current semantically segmented frame in the new pose estimate; and update the 3D semantic model, which includes registering semantic labels of the current semantically segmented frame in the new pose estimate and in the 3D semantic model.
类似技术:
公开号 | 公开日 | 专利标题
BR102018075714A2|2019-07-30|Recurring Semantic Segmentation Method and System for Image Processing
US10360732B2|2019-07-23|Method and system of determining object positions for image processing using wireless network angle of transmission
US10867430B2|2020-12-15|Method and system of 3D reconstruction with volume-based filtering for image processing
US10896072B2|2021-01-19|Systems and methods for motion correction in synthetic images
US20190042871A1|2019-02-07|Method and system of reflection suppression for image processing
US10580140B2|2020-03-03|Method and system of real-time image segmentation for image processing
US9727775B2|2017-08-08|Method and system of curved object recognition using image matching for image processing
US10715773B2|2020-07-14|Method and system of lens shading color correction using block matching
US9852513B2|2017-12-26|Tracking regions of interest across video frames with corresponding depth maps
US10509954B2|2019-12-17|Method and system of image segmentation refinement for image processing
US20170308734A1|2017-10-26|Eye contact correction in real time using neural network based machine learning
CN106797451B|2020-10-16|Visual object tracking system with model validation and management
US20130272609A1|2013-10-17|Scene segmentation using pre-capture image motion
US20200388004A1|2020-12-10|Method and system of point cloud registration for image processing
US10846560B2|2020-11-24|GPU optimized and online single gaussian based skin likelihood estimation
WO2016165614A1|2016-10-20|Method for expression recognition in instant video and electronic equipment
US10664949B2|2020-05-26|Eye contact correction in real time using machine learning
US20210248427A1|2021-08-12|Method and system of neural network object recognition for image processing
US20200267310A1|2020-08-20|Single image ultra-wide fisheye camera calibration via deep learning
CN111429517A|2020-07-17|Relocation method, relocation device, storage medium and electronic device
WO2022021217A1|2022-02-03|Multi-camera person association via pair-wise matching in continuous frames for immersive video
US20210112238A1|2021-04-15|Method and system of image processing with multi-object multi-view association
WO2020227918A1|2020-11-19|Automatic point cloud validation for immersive media
CN113450392A|2021-09-28|Robust surface registration based on parametric perspective of image templates
同族专利:
公开号 | 公开日
US20190043203A1|2019-02-07|
US10685446B2|2020-06-16|
DE102018132245A1|2019-07-18|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US8175379B2|2008-08-22|2012-05-08|Adobe Systems Incorporated|Automatic video image segmentation|
US9754351B2|2015-11-05|2017-09-05|Facebook, Inc.|Systems and methods for processing content using convolutional neural networks|
WO2017120271A1|2016-01-04|2017-07-13|Meta Company|Apparatuses, methods and systems for application of forces within a 3d virtual environment|
US10089742B1|2017-03-14|2018-10-02|Adobe Systems Incorporated|Automatically segmenting images based on natural language phrases|US10402690B2|2016-11-07|2019-09-03|Nec Corporation|System and method for learning random-walk label propagation for weakly-supervised semantic segmentation|
US10812711B2|2018-05-18|2020-10-20|Samsung Electronics Co., Ltd.|Semantic mapping for low-power augmented reality using dynamic vision sensor|
US10834413B2|2018-08-24|2020-11-10|Disney Enterprises, Inc.|Fast and accurate block matching for computer generated content|
US10331983B1|2018-09-11|2019-06-25|Gyrfalcon Technology Inc.|Artificial intelligence inference computing device|
US10997473B2|2018-11-27|2021-05-04|International Business Machines Corporation|Object localization based on spatial relationships|
US20200184668A1|2018-12-05|2020-06-11|Qualcomm Incorporated|Systems and methods for three-dimensional pose determination|
US10810782B1|2019-04-01|2020-10-20|Snap Inc.|Semantic texture mapping system|
US11159717B2|2019-04-18|2021-10-26|eyecandylab Corporation|Systems and methods for real time screen display coordinate and shape detection|
US11176374B2|2019-05-01|2021-11-16|Microsoft Technology Licensing, Llc|Deriving information from images|
US11070786B2|2019-05-02|2021-07-20|Disney Enterprises, Inc.|Illumination-based system for distributing immersive experience content in a multi-user environment|
US11244504B2|2019-05-03|2022-02-08|Facebook Technologies, Llc|Semantic fusion|
CN112013844B|2019-05-31|2022-02-11|北京小米智能科技有限公司|Method and device for establishing indoor environment map|
CN110298320B|2019-07-01|2021-06-22|北京百度网讯科技有限公司|Visual positioning method, device and storage medium|
WO2021009798A1|2019-07-12|2021-01-21|株式会社ソニー・インタラクティブエンタテインメント|Image processing device, image processing method, and program|
US11138410B1|2020-08-25|2021-10-05|Covar Applied Technologies, Inc.|3-D object detection and classification from imagery|
US11145076B1|2020-10-27|2021-10-12|R-Go Robotics Ltd|Incorporation of semantic information in simultaneous localization and mapping|
法律状态:
2019-07-30| B03A| Publication of a patent application or of a certificate of addition of invention [chapter 3.1 patent gazette]|
优先权:
申请号 | 申请日 | 专利标题
US15/870,608|2018-01-12|
US15/870,608|US10685446B2|2018-01-12|2018-01-12|Method and system of recurrent semantic segmentation for image processing|
[返回顶部]